4 research outputs found

    IDENTIFYING AND CHARACTERIZING TRANSPOSABLE ELEMENTS IN THE GENOME

    Get PDF
    A large fraction of mammalian genome consists of transposable elements (TEs). These elements are segments of DNA that either move or are copied from one place in the genome to another. TEs are a significant source of genetic variation and are directly responsible for many diseases. It is difficult to identify, map, characterize, and determine the zygosity of TEs using current high-throughput short-read sequencing data because of their numerous copies in the genome. Existing approaches search for TE insertion (TEi) by aligning millions of mostly irrelevant short reads to either a reference genome or a TE sequence library. In this dissertation I describe two alignment-free novel TEi detection algorithms, ELITE and Frontier which outperform existing tools in several different categories. Both algorithms use local-genome-assembly where ELITE is template-dependent and Frontier is template-free. The key idea is to focus on identifying the boundary of TE insertion which contains partial TE and non-TE context. I use an msBWT-based data structure to store and index all the reads from a high-throughput sequencing dataset and leverages additional data structures FM-index and Longest Common Prefix (LCP) to efficiently search for TEi boundaries. I show that combination of two methods can identify nearly all the Endogenous RetoVirus (ERV) insertions that are segregating in a population with more than 100 samples. These methods can also be used to identify very recent or de novo TE insertions. Moreover, characterization based on the sharing pattern of ERVis allows us to study phylogeny within a population.Doctor of Philosoph

    Genomes of the Mouse Collaborative Cross.

    Get PDF
    The Collaborative Cross (CC) is a multiparent panel of recombinant inbred (RI) mouse strains derived from eight founder laboratory strains. RI panels are popular because of their long-term genetic stability, which enhances reproducibility and integration of data collected across time and conditions. Characterization of their genomes can be a community effort, reducing the burden on individual users. Here we present the genomes of the CC strains using two complementary approaches as a resource to improve power and interpretation of genetic experiments. Our study also provides a cautionary tale regarding the limitations imposed by such basic biological processes as mutation and selection. A distinct advantage of inbred panels is that genotyping only needs to be performed on the panel, not on each individual mouse. The initial CC genome data were haplotype reconstructions based on dense genotyping of the most recent common ancestors (MRCAs) of each strain followed by imputation from the genome sequence of the corresponding founder inbred strain. The MRCA resource captured segregating regions in strains that were not fully inbred, but it had limited resolution in the transition regions between founder haplotypes, and there was uncertainty about founder assignment in regions of limited diversity. Here we report the whole genome sequence of 69 CC strains generated by paired-end short reads at 30× coverage of a single male per strain. Sequencing leads to a substantial improvement in the fine structure and completeness of the genomes of the CC. Both MRCAs and sequenced samples show a significant reduction in the genome-wide haplotype frequencies from two wild-derived strains, CAST/EiJ and PWK/PhJ. In addition, analysis of the evolution of the patterns of heterozygosity indicates that selection against three wild-derived founder strains played a significant role in shaping the genomes of the CC. The sequencing resource provides the first description of tens of thousands of new genetic variants introduced by mutation and drift in the CC genomes. We estimate that new SNP mutations are accumulating in each CC strain at a rate of 2.4 ± 0.4 per gigabase per generation. The fixation of new mutations by genetic drift has introduced thousands of new variants into the CC strains. The majority of these mutations are novel compared to currently sequenced laboratory stocks and wild mice, and some are predicted to alter gene function. Approximately one-third of the CC inbred strains have acquired large deletions (\u3e10 kb) many of which overlap known coding genes and functional elements. The sequence of these mice is a critical resource to CC users, increases threefold the number of mouse inbred strain genomes available publicly, and provides insight into the effect of mutation and drift on common resources. Genetics 2017 Jun; 206(2):537-56

    The Genomes of the Collaborative Cross

    No full text
    The Collaborative Cross (CC) is a multiparent recombinant inbred strain mouse panel derived from eight founder inbred strains. A distinct advantage of recombinant inbred panels is that detailed characterization of their genomes does not need to be performed by each user. Until now the CC genomes were haplotype reconstructions based on dense genotyping of the most recent common ancestors (MRCAs) of each strain followed by imputation from the genome sequence of the corresponding founder inbred strain. The MRCA resource had the advantage that it captured segregating regions in strains that were not fully inbred, but it had limited resolution in the transition regions between founder haplotypes and resulted in uncertainty about founder assignment in regions of limited diversity. Here we report the whole genome sequence of 69 CC strains generated by paired-end short reads at 30X coverage of a single male per strain. Sequencing results in a substantial improvement in the fine structure and completeness of the genomes of the CC. Both MRCAs and sequenced samples have significant reduction in the genome-wide haplotype frequencies of two of the wild-derived strains, CAST/EiJ and PWK/PhJ. In addition, analysis of the evolution of the patterns of heterozygosity indicates that selection against three wild-derived founder strains played a significant role in shaping the genomes of the CC. The sequencing resource provides the first description of tens of thousands of new genetic variants introduced by genetic drift on the CC genomes. The CC strains represent an extreme example of the principle that genetic drift is expected to have maximum impact in populations with small effective size and high level of inbreeding. We estimate that new SNP mutations are accumulating in each CC strain at a rate of 2.4 per Gb per generation. The majority of these mutations are novel compared to currently sequenced laboratory stocks and wild mice, and some are predicted to alter gene function. Overall, genetic drift has increased the number of variants segregating among CC strains by more than 2%. Approximately one third of the CC inbred strains have acquired large deletions (>10kb) many of which overlap known coding genes and functional elements. In conclusion we provide a critical resource to users of the CC increase threefold the number of mouse inbred strain genomes available publicly and provide a striking example of the effect of genetic drift on common resources

    Content and Performance of the MiniMUGA Genotyping Array, a New Tool To Improve Rigor and Reproducibility in Mouse Research.

    No full text
    The laboratory mouse is the most widely used animal model for biomedical research, due in part to its well annotated genome, wealth of genetic resources and the ability to precisely manipulate its genome. Despite the importance of genetics for mouse research, genetic quality control (QC) is not standardized, in part due to the lack of cost effective, informative and robust platforms. Genotyping arrays are standard tools for mouse research and remain an attractive alternative even in the era of high-throughput whole genome sequencing. Here we describe the content and performance of a new iteration of the Mouse Universal Genotyping Array, MiniMUGA, an array-based genetic QC platform with over 11,000 probes. In addition to robust discrimination between most classical and wild-derived laboratory strains, MiniMUGA was designed to contain features not available in other platforms: 1) chromosomal sex determination, 2) discrimination between substrains from multiple commercial vendors, 3) diagnostic SNPs for popular laboratory strains, 4) detection of constructs used in genetically engineered mice, and 5) an easy-to-interpret report summarizing these results. In-depth annotation of all probes should facilitate custom analyses by individual researchers. To determine the performance of MiniMUGA we genotyped 6,899 samples from a wide variety of genetic backgrounds. The performance of MiniMUGA compares favorably with three previous iterations of the MUGA family of arrays both in discrimination capabilities and robustness. We have generated publicly available consensus genotypes for 241 inbred strains including classical, wild-derived and recombinant inbred lines. Here we also report the detection of a substantial number of XO and XXY individuals across a variety of sample types, new markers that expand the utility of reduced complexity crosses to genetic backgrounds other than C57BL/6, and the robust detection of 17 genetic constructs. We provide preliminary evidence that the array can be used to identify both partial sex chromosome duplication and mosaicism, and that diagnostic SNPs can be used to determine how long inbred mice have been bred independently from the relevant main stock. We conclude that MiniMUGA is a valuable platform for genetic QC and an important new tool to the increase rigor and reproducibility of mouse research
    corecore